Search CORE

9 research outputs found

Learning Fast and Slow: PROPEDEUTICA for Real-time Malware Detection

Author: Chen Aokun
Gregio Andre
He Pan
Li Xiaolin
Oliveira Daniela
Sun Ruimin
Yuan Xiaoyong
Zhu Qile
Publication venue
Publication date: 04/12/2017
Field of study

In this paper, we introduce and evaluate PROPEDEUTICA, a novel methodology and framework for efficient and effective real-time malware detection, leveraging the best of conventional machine learning (ML) and deep learning (DL) algorithms. In PROPEDEUTICA, all software processes in the system start execution subjected to a conventional ML detector for fast classification. If a piece of software receives a borderline classification, it is subjected to further analysis via more performance expensive and more accurate DL methods, via our newly proposed DL algorithm DEEPMALWARE. Further, we introduce delays to the execution of software subjected to deep learning analysis as a way to "buy time" for DL analysis and to rate-limit the impact of possible malware in the system. We evaluated PROPEDEUTICA with a set of 9,115 malware samples and 877 commonly used benign software samples from various categories for the Windows OS. Our results show that the false positive rate for conventional ML methods can reach 20%, and for modern DL methods it is usually below 6%. However, the classification time for DL can be 100X longer than conventional ML methods. PROPEDEUTICA improved the detection F1-score from 77.54% (conventional ML method) to 90.25%, and reduced the detection time by 54.86%. Further, the percentage of software subjected to DL analysis was approximately 40% on average. Further, the application of delays in software subjected to ML reduced the detection time by approximately 10%. Finally, we found and discussed a discrepancy between the detection accuracy offline (analysis after all traces are collected) and on-the-fly (analysis in tandem with trace collection). Our insights show that conventional ML and modern DL-based malware detectors in isolation cannot meet the needs of efficient and effective malware detection: high accuracy, low false positive rate, and short classification time.Comment: 17 pages, 7 figure

arXiv.org e-Print Archive

Michigan Technological University

Model Tuning or Prompt Tuning? A Study of Large Language Models for Clinical Concept and Relation Extraction

Author: Bian Jiang
Chen Aokun
Peng Cheng
Smith Kaleb E
Wu Yonghui
Yang Xi
Yu Zehao
Publication venue
Publication date: 09/10/2023
Field of study

Objective To develop soft prompt-based learning algorithms for large language models (LLMs), examine the shape of prompts, prompt-tuning using frozen/unfrozen LLMs, transfer learning, and few-shot learning abilities. Methods We developed a soft prompt-based LLM model and compared 4 training strategies including (1) fine-tuning without prompts; (2) hard-prompt with unfrozen LLMs; (3) soft-prompt with unfrozen LLMs; and (4) soft-prompt with frozen LLMs. We evaluated 7 pretrained LLMs using the 4 training strategies for clinical concept and relation extraction on two benchmark datasets. We evaluated the transfer learning ability of the prompt-based learning algorithms in a cross-institution setting. We also assessed the few-shot learning ability. Results and Conclusion When LLMs are unfrozen, GatorTron-3.9B with soft prompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept extraction, outperforming the traditional fine-tuning and hard prompt-based models by 0.6~3.1% and 1.2~2.9%, respectively; GatorTron-345M with soft prompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end relation extraction, outperforming the other two models by 0.2~2% and 0.6~11.7%, respectively. When LLMs are frozen, small (i.e., 345 million parameters) LLMs have a big gap to be competitive with unfrozen models; scaling LLMs up to billions of parameters makes frozen LLMs competitive with unfrozen LLMs. For cross-institute evaluation, soft prompting with a frozen GatorTron-8.9B model achieved the best performance. This study demonstrates that (1) machines can learn soft prompts better than humans, (2) frozen LLMs have better few-shot learning ability and transfer learning ability to facilitate muti-institution applications, and (3) frozen LLMs require large models

arXiv.org e-Print Archive

Bear: A Framework for Understanding Application Sensitivity to OS (Mis) Behavior

Author: Bishop Matt
Chen Aokun
Lee Andrew
Oliveira Daniela
Porter Donald E.
Sun Ruimin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2016
Field of study

Crossref

eScholarship - University of California

On the Impact of Cross-Domain Data on German Language Models

Author: Bian Jiang
Chen Aokun
Dada Amin
Egger Jan
Friedrich Christoph M.
Heiliger Lars
Idrissi-Yaghir Ahmad
Kleesiek Jens
Li Jianning
Peng Cheng
Seibold Constantin Marc
Smith Kaleb E
Truhn Daniel
Wu Yonghui
Yang Xi
Publication venue
Publication date: 13/10/2023
Field of study

Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to

4.45\%

over the previous state-of-the-art. The models are available at https://huggingface.co/ikim-uk-essenComment: 13 pages, 1 figure, accepted at Findings of the Association for Computational Linguistics: EMNLP 202

arXiv.org e-Print Archive

A Study of Generative Large Language Model for Medical Research and Healthcare

Author: Ahmed Mustafa M
Bian Jiang
Chen Aokun
Costa Anthony B
Flores Mona G
Guo Yi
Hogan William R
Lipori Gloria
Magoc Tanja
Martin Cheryl
Mitchell Duane A
Ospina Naykky S
Peng Cheng
PourNejatian Nima
Shenkman Elizabeth A
Smith Kaleb E
Wu Yonghui
Yang Xi
Zhang Ying
Publication venue
Publication date: 22/05/2023
Field of study

There is enormous enthusiasm and concerns in using large language models (LLMs) in healthcare, yet current assumptions are all based on general-purpose LLMs such as ChatGPT. This study develops a clinical generative LLM, GatorTronGPT, using 277 billion words of mixed clinical and English text with a GPT-3 architecture of 20 billion parameters. GatorTronGPT improves biomedical natural language processing for medical research. Synthetic NLP models trained using GatorTronGPT generated text outperform NLP models trained using real-world clinical text. Physicians Turing test using 1 (worst) to 9 (best) scale shows that there is no significant difference in linguistic readability (p = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance (p = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them (p < 0.001). This study provides insights on the opportunities and challenges of LLMs for medical research and healthcare

arXiv.org e-Print Archive

Recommended from our members

Bear: A Framework for Understanding Application Sensitivity to OS (Mis) Behavior

Author: Bishop Matt
Chen Aokun
Lee Andrew
Oliveira Daniela
Porter Donald E.
Sun Ruimin
Publication venue: eScholarship, University of California
Publication date: 01/10/2016
Field of study

eScholarship - University of California

The Application of Tomato Plant Residue Compost and Plant Growth-Promoting Rhizobacteria Improves Soil Quality and Enhances the Ginger Field Soil Bacterial Community

Author: Aokun Shi
Chaoxing He
Duo Jin
Kunhao Xie
Mintao Sun
Qinghua Di
Ru Chen
Shuangchen Chen
Xianchang Yu
Yansu Li
Publication venue: 'MDPI AG'
Publication date: 23/07/2022
Field of study

Treating and utilizing vegetable residues may reduce waste and improve rhizosphere soil. This study explored the effects of tomato plant residue compost and plant growth-promoting rhizobacteria (PGPR) on the physicochemical properties and microbial community of ginger field soil. Four treatment procedures were adopted: no compost or PGPR (CK), compost (TC), compost + Bacillus subtilis (TC-BS), and compost +Bacillus amyloliquefaciens SQR9 (TC-BA). The results showed that compared with the CK, TC significantly increased soil organic matter, alkali hydrolyzable nitrogen, available phosphorus, and available potassium by 17.34%, 21.66%, 19.56%, and 37.20%, respectively. Soil urease activity, neutral phosphatase activity, and sucrase activity increased by 55.89%, 35.59%, and 57.21%, respectively. Chloroflexi, Gemmatimonadetes, and Bacillus abundances increased by 1.40%, 1.80%, and 0.68%, respectively, while Firmicutes decreased by 0.80%. TC-BS significantly improved soil bacterial diversity than CK and TC, and relative abundance of Beneficial Proteobacteria, Acidobacteria, Chloroflexi, and Bacillus microorganisms dominated. Principal coordinate analysis revealed significant differences in bacterial community structure among different treatments. Redundancy analysis indicated total potassium (p = 0.002), pH (p = 0.0012), and available phosphorus (p = 0.016) as the main community composition driving factors. In conclusion, B. subtilis inoculation in ginger field soil supplemented with tomato compost enhanced bacterial diversity, altered bacterial community structure, enriched beneficial microorganisms, and promoted a healthy rhizosphere

Multidisciplinary Digital Publishing Institute

A large language model for electronic health records

Author: Anthony B. Costa
Aokun Chen
Cheryl Martin
Christopher A. Harle
Christopher Parisien
Colin Compas
Duane A. Mitchell
Elizabeth A. Shenkman
Gloria Lipori
Hoo Chang Shin
Jiang Bian
Kaleb E. Smith
Mona G. Flores
Nima PourNejatian
Tanja Magoc
William R. Hogan
Xi Yang
Ying Zhang
Yonghui Wu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2022
Field of study

Abstract There is an increasing interest in developing artificial intelligence (AI) systems to process and interpret electronic health records (EHRs). Natural language processing (NLP) powered by pretrained language models is the key technology for medical AI systems utilizing clinical narratives. However, there are few clinical language models, the largest of which trained in the clinical domain is comparatively small at 110 million parameters (compared with billions of parameters in the general domain). It is not clear how large clinical language models with billions of parameters can help medical AI systems utilize unstructured EHRs. In this study, we develop from scratch a large clinical language model—GatorTron—using >90 billion words of text (including >82 billion words of de-identified clinical text) and systematically evaluate it on five clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference (NLI), and medical question answering (MQA). We examine how (1) scaling up the number of parameters and (2) scaling up the size of the training data could benefit these NLP tasks. GatorTron models scale up the clinical language model from 110 million to 8.9 billion parameters and improve five clinical NLP tasks (e.g., 9.6% and 9.5% improvement in accuracy for NLI and MQA), which can be applied to medical AI systems to improve healthcare delivery. The GatorTron models are publicly available at: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_og

Directory of Open Access Journals